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Abstract 

We give a universal kernel that renders all the regular languages lin- 
early separable. We are not able to compute this kernel efficiently and con- 
jecture that it is intractable, but we do have an efficient e-approximation. 

1 Background 

Since the advent of Support Vector Machines (SVMs), kernel methods have 
flourished in machine learning theory [7] . Formally, a kernel is a positive definite 
function from X x X to R, which, via Mercer's theorem, endows an abstract 
set with the structure of a Hilbcrt space. Kernels provide both computational 
and theoretical power. The so-called kernel trick, when available, allows us to 
bypass computing the explicit embedding (f> : X — > M.^ in feature space via the 
identity K(x, y) = {(f>(x), 4>{y)); this can lead to a considerable gain in efficiency 
On a more conceptual level, imposing an inner product space structure on an 
abstract set allows us to harness the theoretical and computational utility of 
linear algebra and convex optimization. 

A concrete example where kernel methods provide a palpable advantage over 
more direct approaches is that of learning finite automata from labeled strings. 
Indeed, the most obvious way to infer a DFA from such a sample is to build 
the smallest automaton that accepts all the positive strings and none of the 
negative ones. A straightforward "Occam's Razor" argument [H Theorem 2.1] 
shows that with this strategy a polynomial (in 1/e, 1/5 and target automaton 
size) number of samples is sufficient to ensure a generalization error of no more 
than e with confidence at least \ ~5. Of course, there has to be a catch - finding 
the smallest automaton consistent with a set of accepted and rejected strings 
was shown to be NP-complete by Angluin [T] and Gold [3J; this was further 
strengthened in the hardness of approximation result of Pitt and Warmuth [6] . 

In [5], Kontorovich, Cortes and Mohri proposed an alternate framework 
for learning regular languages. Strings are embedded in a high-dimensional 
space and language induction is achieved by constructing a maximum-margin 
hyperplanc. This hinges on every language in a family of interest being linearly 
separable under the embedding, and on the efficient computability of the kernel. 
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This line of research is continued in [2], where linear separability properties of 
rational kernels are investigated. 

In this paper, we give a universal kernel that renders all the regular languages 
linearly separable. Any linearly separable language necessarily has a positive 
margin, and standard generalization guarantees apply; see [5] for details. We are 
not able to compute this kernel efficiently and conjecture that it is intractable, 
but we do have an efficient e-approximation. Even with these limitations, it 
appears that the technique we propose is the first tool to tackle unsupervised 
learning of unrestricted regular languages. 

2 Linearly separable concept classes 

Let C be a countable concept class defined over a countable set X. We will 
say that a concept c G C is finitely linearly separable if there exists a mapping 
(j> : X — > {0, 1} N and a weight vector w G K N , both with finite support, i.e., 
||u>|| < oo and ||0(x)|| o < oo for all x £ X, such that 

c = {x G X : (w, <f>(x)) > 0}. 

The concept class C is said to be finitely linearly separable if all c G C are finitely 
linearly separable under the same mapping <f>. 

Note that the condition ||0(-)|| o < oo is important; otherwise, we could define 
the embedding by concept <p : X {0, 1} C 

[<f>{x)] c = l{ 2 , ec }, ceC 

and for any target c G C, 

W c = l{ c= c}. 

This construction trivially ensures that 

(w,<f>(x)) = l{ x£ c } , xeX 

(another reason to require ||</>(-)llo < cxd is that it automatically makes the kernel 
K(x,y) = (<f>(x), <t>{y)) well-defined for all x, y G X). 

Similarly, wc disallow \\w\\ Q — oo due to the algorithmic impossibility of stor- 
ing infinitely many numbers and also because it leads to the trivial construction, 
via embedding by instance: 

[(t>{x)] u = ^-{x=u}, ue x, 

and for any target c G C, 

w u = !{«€£}• 

This again ensures (w,(j>(x)) = l^g^} without doing anything interesting or 
useful. 

1 Throughout this paper, we index vectors by integers or members of other countable sets, 
as dictated by convenience. 
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In light of the examples above, from now on when we speak of linear separa- 
bility of a concept class, we shall always assume that X and C are countable and 
that w and </)(■) have finite support. An immediate question is whether every 
concept class is linearly separable in this sense. A positive answer would require 
a construction of the requisite <f> given X and C; a negative answer would entail 
an example of X and C for which no such embedding exists. 

3 Every concept class is linearly separable 

In this section we give an affirmative answer to the question raised in Sec. [2j 

Theorem 3.1. Every countable concept class C over a countable instance space 
X is linearly separable. 

Proof. Let C be a countable concept class over the countable instance space X. 
Define two size functions on X and C: 

\-\:X->H, ll-H :C — »N 

with the property that each has finite level sets < oo for each n 6 N); 

in words, there are at most finitely many elements of a fixed size. Any countable 
set has such a size function. We will define two auxiliary embeddings, x an d ct, 
and will construct the requisite 4> as their direct sum. For intuition, it is helpful 
to keep in mind the dual roles of X and C. Fix a target c 6 C. 
Define the embedding by instance x '■ X — > {0, 1} X by 

[x{x)]u = !{*=«}, u e X: 

obviously, HxWIlo = ^ f° r au x £ X. Define the corresponding hyperplane 
w x 6 R x by 

[w x ] u = l{ ue e}l{|„|<||e||}, u e X; 
since size functions have finite level sets, we have ||w x || < oo. Thus, 

(w x ,x{x)) = ^2[w x ] u [x(x)]u 
uex 

= X! ^{uec}^{\u\<\\c\\}^{x=u} 

= l{xee}l{|x|<||c||}- (1) 

Define the embedding by concept a : X — > {0, 1} C by 

[a(a;)] c = l{ Kec }l{|| c ||<| x |}, c e C: 

since size functions have finite level sets, we have ||a(x)|| < oo. The corre- 
sponding hyperplane w a e M c is defined by 

[w a ] c = i {c= c}, ceC. 
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Now 



(w a ,a(x)) = £> a ] c [a(aO] e 

cGC 

= l{ C =c}l{ K6 c}l{||c||<|x|} 
cGC 

= 1 {xea} 1 {|x|>||6||}- (2) 

We define the canonical embedding </> : X — > {0, 1} N as the direct sum of the 
embeddings by instance and concept: 

(f>(x) = x{x) ®a(x); 

note that 

U(x)\\ = llxWIIo + H^llo <°°- 
Similarly, the corresponding hyperplane is the direct sum of the two hypcrplanes: 

w = w x © w a ; 

again, 

Hlo = IMIo + IKIIo< 00 - 

Combining |T]) and ([2|), we get 

(w, 4>(x)) = (w x ,x(x)) + (w a ,a(x)} 

= l{KGc}l{|a;|<||c||} + 1 (ie£) 1 {\x \ > \\ c|| } 

which shows that w is indeed a linear separator (with finite support) for c. □ 

4 Universal regular kernel 

To apply Theorcm l3.1l to regular languages (over a fixed alphabet S), we observe 
that the DFAs are a countable concept class 1Z = Li„>i DFA(n) over X = E*, 
where DFA(n) is the set of all DFAs on n states. Denoting by ||A|| the number 
of states in A e 1Z, we see that |-| is a valid size function on 1Z. A natural size 
function on E* is string length, denoted by |-|. With these two size functions, 
Theorem 13.11 furnishes an embedding <f> : 1Z — > {0, 1} N that renders all regular 
languages linearly separable. To get a better feel for this embedding, let us 
compute its associated kernel 

K(x,y) = (4>{x)A{y)) 

min{|x|,|y|} 

= l {x=y} + ^2 K n (x,y) 
n=l 

where 

K n (x,y) = l {xeL{A)] l {yeL{A)} . (3) 

AeDFA(n) 
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In other words, K n (x,y) counts the number of n-state DFAs that accept both 
x and y. By [5J Theorem 6], an immediate consequence of this construction 
is that every regular language L can be represented by some support strings 
(s,eE*:l<!< m} with weights a e W n : 

{m 
»=i 



5 Computing K n 

Since the summation in ([3]) involves a super-exponential number of terms, brute- 
force evaluation is out of the question. Though we consider the complexity of 
K n to be a likely candidate for #P-complete, we have no proof of this; there 
is also the hope that the symmetry in the problem will enable a clever efficient 
computation. 

In the meantime, wc must resort to a Monte Carlo simulation. For n > 
and x,y € £*, define P n (x,y) to be the fraction of all the DFAs on n states 
that accept both x and y. Thus, < P n (x,y) < 1, and computing this quantity 
is tantamount to computing K n (x,y) = P n {x,y) |DFA(n)|. Now it is a simple 
matter to generate n-state DFAs uniformly at random. Let {Ai : 1 < i < m} be 
such an independent sample of m-state DFAs, and compute the approximation 
to P n (x,y): 



m 

P n (x,y) = -Vl 



{x£L(A < )}JL{ 1 ,ei(A<)}- 

m i=l 



Then, by Chcrnoff 's bound, wc have 

P{|Pn(s,y)--Pn(s,y)| >eP„(a:,y)} < 2cxp(-e 2 mP„(x, y)/3), 
meaning that with probability at least 1 — 2 exp(— 2t 2 mP n {x, y)), we have 

(1 - e)K(x, y) < K(x, y)<(l + e)K(x, y), 
where K n (x,y) = P n (x,y) |DFA(n)|. Thus, we need 

3Iog(2/a) 



m > 



e 2 P n (x,y) 



sampling steps to have an e-approximation to K (x, y) with probability at least 
1 - a. 

It remains to lower-bound P n {x, y); if it turns out to be exponentially small 
in automaton size n, the e-approximation will require exponentially many steps. 
Fortunately, this does not happen: 

Theorem 5.1. For all n > 1, for all x, y £ £*, we have 

\<Pn{x,y) < \. 
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Proof. The upper bound is simple - it follows from the fact that K n (x,x) = 
i DFA(n)|. Indeed, for any x e £*, for every A + G DFA(n) that accepts x 
there is exactly one A~ € DFA(n) that does not (obtained by changing the 
state in which A + ends up after reading x from accepting to non- accepting). 
The upper bound follows from the obvious relation K n (x,y) < K n (x,x) for all 
x,y e E*. 

To prove the lower bound, take the "worst" case where x,y € S* are such 
that every A 6 DFA(n) has 5{qo,x) ^ 5(qo,y). In other words, no automaton 
ends up in the same state after reading x and as it does after reading y. Since 
every state is independently chosen to be accepting or not with equal probability, 
exactly one-fourth of all A e DFA(n) will accept both x and y. Clearly, this 
fraction will be higher if we allow some automata to end up in the same state 
upon reading x and y. □ 

This means that if we run the (very simple and efficient) simulation algorithm 
for m = 12e~ 2 log(2/a) steps, we will have an e-approximation to K n (x, y) with 
probability at least 1 — a. 

6 Conclusion 

Many fascinating questions arise naturally around the kernel K n that we defined: 
Is it (or any other universal regular kernel) efficiently computable? How can one 
efficiently recover the automaton from the hyperplane? Can quantitative margin 
bounds be obtained (perhaps in terms of automaton size)? These questions hold 
potential for promising future research. 
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